Baseball was the first major sport to embrace analytics. The game lends itself to recording data. The game has a small number of possible states, possible outcomes at any given moment.

One could do a term project on the development of analytics in sports, from Bill James, to Moneyball, to the present statcast era.

Rules Overview

For those unfamiliar with the game.

  • Nine players per side (with substitutions available)
  • Nine innings per game. Each inning has two halves.
  • In each half inning one team bats, the other plays the field.
  • The basic unit of action in baseball is the pitch:
    • the pitcher pitches a ball over home plate to be received by the catcher.
    • the batter tries to hit the pitch into play.
    • If the ball is hit into play, the batter tries to advance to 1st base (or beyond!).
  • Possible outcomes of any one pitch:
    • ball
    • hit by pitch
    • swinging strike
    • called strike
    • foul ball
    • hit into play
  • The hitting team scores a run when a player advances around the bases to return to home plate.
  • The team with the most runs after 9 innings wins. (No ties, so ‘extra’ innings if necessary.)
  • The scoreboard:
Team 1 2 3 4 5 6 7 8 9 R H E
visitor 0 1 2 0 0 1 0 0 0 4 13 1
home 2 1 0 0 3 0 1 1 x 8 9 0
  • The 9 defensive positions of the fielding team:

source: whatisbaseball.com/wp-content/uploads/2019/04/Diagram-of-BB.png

Major League Baseball

MLB currently has 30 teams, which are divided into two leagues of 15 teams, the American League (AL) and the National League (NL). The teams in each league are divided into 3 divisions. We list below each team by its affiliation in this league structure.

In a given season each team plays 162 games, and then the top 6 teams in each league make the playoffs, where teams square off in best-of-3, best-of-5, or best-of-7 series. The champion of the American League plays the champion of the National League in the World Series. The winner of the World Series is the champion of MLB. The only current team never to have played in the World Series is my Seattle Mariners. The number of times each team has won the World Series is given in parentheses.

American League

AL East
Baltimore Orioles (3)*
Boston Red Sox (9)
New York Yankees (27)
Tampa Bay Rays (0)
Toronto Blue Jays (2)

AL Central
Chicago White Sox (3)
Cleveland Guardians (2)
Detroit Tigers (4)
Kansas City Royals (2)
Minnesota Twins (3)

AL West
Houston Astros (2)
Los Angeles Angels (1)
Seattle Mariners (0)
Texas Rangers (1)
The Athletics(9)

National League

NL East
Atlanta Braves (4)
Miami Marlins (2)
New York Mets (2)
Philadelphia Phillies (2)
Washington Nationals (1)

NL Central
Chicago Cubs (3)
Cincinnati Reds (5)
Milwuakee Brewers (0)
Pittsburgh Pirates (5)
St. Louis Cardinals (11)

NL West
Arizona Diamondbacks (1)
Colorado Rockies (0)
Los Angeles Dodgers (8)
San Diego Padres (0)
San Francisco Giants (8)

* number of world series titles

Baseball Statistics: A Crash Course

Major League Baseball (MLB) statistics have been kept for over 100 years, and these statistics can be found in the Lahman data sets. Examples of many of these counting statistics can be found in the videos linked above.

The Box Score provides a statistical summary of each player’s game performance Retrosheet - a nice site with historical baseball statistics, including game by game results in the old box score format: - basic box score at retrosheet

Glossary of Statistics

The Lahman package provides a glossary of statistics appearing in its data sets, which can be viewed in RStudio’s help menu (lower right pane - click on Help tab and search for Lahman), assuming you’ve loaded the Lahman package into your session.

The glossary of standard statistics at the MLB site has thorough descriptions of these statistics.

Rate Statistics for Hitters

The goal in baseball is to win games. To win games, you need to score more runs than the other team. So, we want hitter’s that help team’s create runs, and we want fielder’s and pitchers that prevent the other team from scoring runs. From the standard counting statistics, we can create statistics to measure how much a hitter contributes to the creation of runs on offense. Here we look at traditional measures of hitting performance.

AVG - Batting average

Batting average is the ratio hits divided by at bats - it measures the proportion of at bats that result in a base hit.

AVG = H / AB

In MLB, a batting average of .300 is very good!

A traditionally important statistic, the player with the highest batting average in a season is traditionally said to have won the batting title. Furthermore, the rare triple crown winner is a hitter who leads the league in batting average, home runs, and runs batted in. AVG had cache baby!

Batting average only tells part of the story of a hitter’s contribution to team runs. There are ways to contribute to team runs without getting a hit (like earning a walk, hitting a sacrifice fly, getting hit by a pitch, stealing a base, etc.) Moreover, not all hits contribute the same amount to a team’s runs (clearly a home run is worth more than a single). Batting average is a clumsy statistic in terms of assessing a player’s real contribution to run creation.

OBP - On-base percentage

Teams need base runners to score runs. A batter’s on base percentage essentially measures the proportion of their plate appearances in which they made it safely to base by their own “skill” (\(H\) - hit, \(BB\) - walk (aka base on balls), or \(HBP\) - hit by pitch)

OBP = (H + BB + HBP)/(AB + BB + HBP + SF)

In MLB an OBP of .400 is super, .350 is solid, and .320 is about average.

Like batting average, on-base percentage does not distinguish between the type of hit a batter gets. For instance, OBP equates the value of hitting a home run to the value of hitting a single.

SLG - Slugging percentage

Slugging percentage is a weighted batting average, where each type of hit is weighted by how many bases the hitter takes:

\[SLG = \frac{\text{(singles)} + 2*\text{(doubles)} + 3*\text{(triples)} + 4* \text{(home runs)}}{\text{at bats}}\]

Usually, slugging percentage is calculated via the formula below since it is traditional not to record singles, but total hits instead, along with 2B, 3B, and HR:

SLG = (H + 2B + 2*3B + 3*HR)/AB

A slugging percentage above .500 is considered very good for MLB.

OPS - On base plus slugging

OPS is a simple way to combine how likely the batter is to get on base (OBP) along with weighting the different hits according to how many bases the hits earned (SLG). This statistic generally does a much better job of predicting team runs than batting average. Here’s the formula:

OPS = OBP + SLG

An OPS above .800 is above average, and the very best of hitters can achieve an OPS above .950 or 1.000.

wOBA - Weighted on-base average

We’ll consider this statistic in Lab 1!

Rate Statistics for Pitchers

ERA - Earned Run Average

Pitchers have two runs allowed type statistics, R and ER. Roughly speaking, R (runs) counts the number of runs that scored against the pitcher; and ER (earned runs) counts the runs that scores against a pitcher without the benefit of an error or a passed ball.

ERA is a simple rate statistic that gives the number of earned runs the pitcher allows on a per 9 innings pitched basis:

ERA = ER / IP * 9

For pitchers, who try to prevent runs from scoring, the lower the better for ERA. A season ERA of 3.00 (meaning they allow 3 earned runs for every 9 innings they pitched, on average) is quite good in MLB.

WHIP - Walks plus Hits per Innings Pitched

This rate statistic gives a measure of how many base runners a pitcher allows (by walk (BB) or hit (H)), on a per inning basis:

WHIP = (BB + H) / IP

In MLB a WHIP of 1.00 is very good, and the lower the WHIP the better!

K9 and K% (Stikeouts per 9 innings and strikeout rate)

Strikeouts are lovely things for a pitcher, and two common rate stats for strikeouts are

K9 = SO / IP * 9

which gives a pitcher’s strikeout totals on a per 9 inning basis.

A better measure of a pitcher’s strikeout ability is the strikeout rate, which is simply the proportion of the batters faced (BF) that they strikeout.

K% = SO / BF

I like K% better than K9 as a measure of pitching performance, because the denominator in K% is the actual number of strikeout opportunities (BF), while the denominator in K9 is not. Let’s take a simple example.

Pitcher A pitches 30 innings (and gets 90 outs). In those 30 innings they face 150 batters, and strikeout 25 of them. Pitcher A has a K9 of 25/30*9 = 7.5. They struck out 7.5 batters for every 9 innings they pitched.

Pitcher B also pitches 30 innings. In those 30 innings they face 100 batters, striking out 20 of them. Pitcher B has K/9 of 20/30*9 = 6.0.

But Pitcher A faced a lot more batters than Pitcher B (meaning they had a harder time getting outs, right?) Facing more batters means they have more opportunities to strikeout a batter.

Pitcher A’s strikeout rate, K%, equals 25/150 = .167.

Pitcher B only faced 100 batters (getting 90 outs much more efficiently than Pitcher A), and their K% is 20/100 = .200.

Pitcher B was better at getting strikeouts (and outs!)

An analogous example in basketball: which is a better measure of 3-point shooting ability, recording 3-point shots made per game or 3-point shooting percentage (3-point shots made divided by total attempts)? The second measure is clearly better: ‘per game’ is the wrong denominator for determining shooting ability because it ignores how many times a player attempts a 3-point shot.

BB9 and BB% (Walks per 9 innings and Walk rate)

A good pitcher strikes out a lot of hitters and walks very few! Both K% and BB% (as well as the difference K-BB%) are important measures of pitching performance.

BB9 = BB / IP * 9
BB% = BB / BF

Same discussion applies here. BB% is a better measure of avoiding walks than BB9.

FIP (Fielder Independent Pitching)

Charging hits and runs to a pitcher doesn’t take into account the defense. So the rate statistics ERA and WHIP depend somewhat on the variability of fielder’s performance (and positioning) behind the pitcher. The statistic FIP measures a pitcher’s performance based on three plate appearance outcomes that do not depend on fielding: strikeouts, walks, and home runs. This formula has a constant c (which changes each year) built into it so that the value of FIP is on the same scale as ERA (to make FIP an easier number to interpret). The Fangraphs guts page has the FIP constant for each season (the cFIP column). The cFIP constant is usually close to 3.1.

FIP = (13*HR + 3*BB - 2*SO)/IP + c

So, like ERA, a FIP of 3.00 is usually excellent, and 4.00 is closer to average. In 2024, the league average FIP was 4.08 (and league average ERA was also 4.08).

Statcast statistics

Since around 2015, MLB has recorded physical data surrounding baseball plays that has provided a wealth of new data to consider when analyzing player performance. Information about these statistics can be found at the baseball savant glossary

A peek at Statcast Data

Possible states

Baseball has a small number (24) of possible game states at the start of each plate appearance:
outs 3rd base 2nd base 1st base
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
2 0 0 0
2 0 0 1
2 0 1 0
2 0 1 1
2 1 0 0
2 1 0 1
2 1 1 0
2 1 1 1

After we cover probability theory we will return to these states to consider the likelihood of scoring runs in each of them.